19 research outputs found
Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras
Visual scene understanding is an important capability that enables robots to
purposefully act in their environment. In this paper, we propose a novel
approach to object-class segmentation from multiple RGB-D views using deep
learning. We train a deep neural network to predict object-class semantics that
is consistent from several view points in a semi-supervised way. At test time,
the semantics predictions of our network can be fused more consistently in
semantic keyframe maps than predictions of a network trained on individual
views. We base our network architecture on a recent single-view deep learning
approach to RGB and depth fusion for semantic object-class segmentation and
enhance it with multi-scale loss minimization. We obtain the camera trajectory
using RGB-D SLAM and warp the predictions of RGB-D images into ground-truth
annotated frames in order to enforce multi-view consistency during training. At
test time, predictions from multiple views are fused into keyframes. We propose
and analyze several methods for enforcing multi-view consistency during
training and testing. We evaluate the benefit of multi-view consistency
training and demonstrate that pooling of deep features and fusion over multiple
views outperforms single-view baselines on the NYUDv2 benchmark for semantic
segmentation. Our end-to-end trained network achieves state-of-the-art
performance on the NYUDv2 dataset in single-view segmentation as well as
multi-view semantic fusion.Comment: the 2017 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2017
EgoHumans: An Egocentric 3D Multi-Human Benchmark
We present EgoHumans, a new multi-view multi-human video benchmark to advance
the state-of-the-art of egocentric human 3D pose estimation and tracking.
Existing egocentric benchmarks either capture single subject or indoor-only
scenarios, which limit the generalization of computer vision algorithms for
real-world applications. We propose a novel 3D capture setup to construct a
comprehensive egocentric multi-human benchmark in the wild with annotations to
support diverse tasks such as human detection, tracking, 2D/3D pose estimation,
and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses
for the egocentric view, which enables us to capture dynamic activities like
playing tennis, fencing, volleyball, etc. Furthermore, our multi-view setup
generates accurate 3D ground truth even under severe or complete occlusion. The
dataset consists of more than 125k egocentric images, spanning diverse scenes
with a particular focus on challenging and unchoreographed multi-human
activities and fast-moving egocentric views. We rigorously evaluate existing
state-of-the-art methods and highlight their limitations in the egocentric
scenario, specifically on multi-human tracking. To address such limitations, we
propose EgoFormer, a novel approach with a multi-stream transformer
architecture and explicit 3D spatial reasoning to estimate and track the human
pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 on the
EgoHumans dataset.Comment: Accepted to ICCV 2023 (Oral
In-Hand 3D Object Scanning from an RGB Sequence
We propose a method for in-hand 3D scanning of an unknown object with a
monocular camera. Our method relies on a neural implicit surface representation
that captures both the geometry and the appearance of the object, however, by
contrast with most NeRF-based methods, we do not assume that the camera-object
relative poses are known. Instead, we simultaneously optimize both the object
shape and the pose trajectory. As direct optimization over all shape and pose
parameters is prone to fail without coarse-level initialization, we propose an
incremental approach that starts by splitting the sequence into carefully
selected overlapping segments within which the optimization is likely to
succeed. We reconstruct the object shape and track its poses independently
within each segment, then merge all the segments before performing a global
optimization. We show that our method is able to reconstruct the shape and
color of both textured and challenging texture-less objects, outperforms
classical methods that rely only on appearance features, and that its
performance is close to recent methods that assume known camera poses.Comment: CVPR 202
Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet
Multi-person pose understanding from RGB videos involves three complex tasks:
pose estimation, tracking and motion forecasting. Intuitively, accurate
multi-person pose estimation facilitates robust tracking, and robust tracking
builds crucial history for correct motion forecasting. Most existing works
either focus on a single task or employ multi-stage approaches to solving
multiple tasks separately, which tends to make sub-optimal decision at each
stage and also fail to exploit correlations among the three tasks. In this
paper, we propose Snipper, a unified framework to perform multi-person 3D pose
estimation, tracking, and motion forecasting simultaneously in a single stage.
We propose an efficient yet powerful deformable attention mechanism to
aggregate spatiotemporal information from the video snippet. Building upon this
deformable attention, a video transformer is learned to encode the
spatiotemporal features from the multi-frame snippet and to decode informative
pose features for multi-person pose queries. Finally, these pose queries are
regressed to predict multi-person pose trajectories and future motions in a
single shot. In the experiments, we show the effectiveness of Snipper on three
challenging public datasets where our generic model rivals specialized
state-of-art baselines for pose estimation, tracking, and forecasting
FroDO: From Detections to 3D Objects
Object-oriented maps are important for scene understanding since they jointly
capture geometry and semantics, allow individual instantiation and meaningful
reasoning about objects. We introduce FroDO, a method for accurate 3D
reconstruction of object instances from RGB video that infers object location,
pose and shape in a coarse-to-fine manner. Key to FroDO is to embed object
shapes in a novel learnt space that allows seamless switching between sparse
point cloud and dense DeepSDF decoding. Given an input sequence of localized
RGB frames, FroDO first aggregates 2D detections to instantiate a
category-aware 3D bounding box per object. A shape code is regressed using an
encoder network before optimizing shape and pose further under the learnt shape
priors using sparse and dense shape representations. The optimization uses
multi-view geometric, photometric and silhouette losses. We evaluate on
real-world datasets, including Pix3D, Redwood-OS, and ScanNet, for single-view,
multi-view, and multi-object reconstruction.Comment: To be published in CVPR 2020. The first two authors contributed
equall
Multi-volume mapping and tracking for real-time RGB-D sensing
In recent years, many research has been devoted to real-time dense mapping and tracking techniques due to the availability of low-cost RGB-D cameras. In this paper, we present a novel multi-volume mapping and tracking algorithm to generate photo-realistic mapping while maintaining accurate and robust camera tracking. The algorithm deploys one small volume of high voxel resolution to obtain detailed maps of near-field objects, while utilizes another big volume of low voxel resolution to increase robustness of tracking by including far-field scenes. The experimental results show that our multivolume processing scheme achieves an objective quality gain of 2 dB in PSNR and 0.2 in SSIM. Our approach is capable of real-time sensing with approximately 30 fps and can be implemented on a modern GPU.</p